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Abstract. Although taxonomy is often used informally to evaluate the results 
of phylogenetic inference and find the root of phylogenetic trees, algorithmic 
methods to do so are lacking. In this paper we formalize these procedures and 
develop algorithms to solve the relevant problems. In particular, we introduce a 
new algorithm that solves a "subcoloring" problem for expressing the difference 
between the taxonomy and phylogeny at a given rank. This algorithm improves 
upon the current best algorithm in terms of asymptotic complexity for the 
parameter regime of interest; we also describe a branch-and-bound algorithm 
that saves orders of magnitude in computation on real data sets. We also 
develop a formalism and an algorithm for rooting phylogenetic trees according 
to a taxonomy. All of these algorithms are implemented in freely-available 
software. 



1. Introduction 

Since the beginnings of phylogenetics, researchers have used a combination of 
phylogenetic inference and taxonomic knowledge to understand evolutionary re- 
lationships. Taxonomic classifications are often used to diagnose problems with 
phylogenetic inferences, and conversely, phylogeny is used to bring taxonomies up 
to date with recent phylogenetic inferences. Similarly, biologists often evaluate a 
putative "root" of a phylogenetic tree by looking at the taxonomic classifications 
of the subtrees branching off that node. 

This work is commonly done by hand. That is, researchers who have knowledge 
of the taxa in their phylogenetic tree use their knowledge of the taxonomy to root 
the tree and describe the level of taxonomic concordance. Despite the frequency 
with which these operations are done, we are not aware of any algorithms or software 
explicitly designed to address this problem. 

In this paper we propose algorithms to express the discord between the taxonomy 
and the phylogeny and to root the tree taxonomically. Our choice of algorithms will 
be guided by the parameter regime of relevance for modern molecular phylogenetics 
on marker genes: that of large bifurcating trees and a limited amount of discord 
with the taxonomy. 

We state the agreement problem between the taxonomy and phylogeny in terms 
of an "subcoloring" problem previously described in the computer science literature 
[10|lllj. As described below, we make algorithmic improvements over previous work 
in the relevant parameter regime and present the first computer implementation. 
For rooting, we show that the "obvious" definition has major defects when there 
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is discordance between the phylogeny and the taxonomy at the highest multiply- 
occupied taxonomic rank. We then present a more robust alternative definition and 
algorithms that can quickly find a taxonomically-defined root. 

A related, but different, problem involves updating taxonomies based on phyloge- 
netic inferences. The most ambitious such project involves completely replacing the 
Linnean taxonomic system with a phylogenetic naming system, called PhyloCode 
[7] . This proposal has met with substantial resistance from the community [51 112] 
and does not appear to have gained wide acceptance as of 2011. Less ambitious 
but more highly accepted such projects include the Bergey [9| and Greengenes [5] 
taxonomies; these have been curated to be more concordant with 16s phylogeny. 
We do not approach the updating problem here, rather, we are interested in the 
commonly encountered problem of a researcher inferring a phylogenetic tree and 
wishing to understand the level of agreement of that tree with the taxonomy at 
various ranks and wishing to root the tree taxonomically. 

2. Expressing the differences between the taxonomy and the 

phylogeny 

2.1. Informal introduction. In this paper we will consider agreement with the 
taxonomy one taxonomic rank at a time, in order to separate out the different 
factors that can lead to discord between taxonomy and phylogeny. These factors 
include phylogenetic methodology problems, horizontal gene transfer, lineage sort- 
ing, out of date taxonomic assignments, and mislabeling. Various such factors lead 
to discordance at distinct ranks. For example, we have observed rampant mislabel- 
ing at the species level in public databases, whereas higher-level assignments are 
typically more accurate. Phylogenetic signal saturation or model mis-specification 
problems can lead to incorrect branching pattern near the root of the tree at the 
higher taxonomic levels, although the genus-level reconstructions can be correct. 

An alternative to considering agreement one rank at a time would be to look for 
the largest set of taxa for which the induced taxonomy and phylogenetic tree agree 
on all levels. Agreement between taxonomy and phylogeny at all taxonomic ranks 
simultaneously is equivalent to requiring complete agreement of the phylogeny and 
taxonomic tree. Finding a subset of leaves on which two trees agree is known 
as Maximum Compatible Subtree (MCST), known to be polynomial for trees of 
bounded degree and NP-hard otherwise |H]. Although such a solution is useful in- 
formation, we have pursued a rank-by-rank approach here for the reasons described 
above. 

We formalize the agreement of taxonomy with the phylogeny on a rank-by-rank 
basis in terms of convex colorings |10l lll| . Informally, a convex coloring is an 
assignment of the leaves of a tree to elements of a set called "colors" such that 
the induced subtrees for each color are disjoint. In this paper we will say that 
a phylogeny agrees with the taxonomic classification at rank r if the taxonomic 
classifications at rank r induce a convex coloring of the tree. For example, in 
Figure [T] the tree is not convex on the species level due to nonconvexity between 
species si and S2, although it is convex on the genus level, as the gi and 52 genera 
fall into distinct clades. In terms of the convex coloring definition, there is nontrivial 
overlap between the induced trees on si and 82- 

We will express the level of discord between the taxonomy and the phylogeny at 
a rank r in terms of the size of a maximal convex suhcoloring. Given an arbitrary 
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Figure 1 . A taxonomically labeled phylogenetic tree that is con- 
cordant with the genus level taxonomic assignments gi but not the 
species taxonomic assignments s^. 

leaf coloring, a subcoloring is a coloring of some subset of the leaves S of the 
tree that agrees with the original coloring on the set S. The maximal convex 
subcoloring is the convex subcoloring of maximal cardinality. For a tree that is 
taxonomically labeled at the tips, the discord at rank r is defined to be the size 
of the maximal convex subcoloring when the leaves are colored according to the 
taxonomic classifications at rank r. 

Our algorithmic contributions are twofold. First, by developing an algorithm 
that only investigates removing colors when such a removal could make a differ- 
ence, we show that the maximal convex subcoloring problem can be solved in a 
number of steps that scales in terms of a local measure of nonconvexity rather 
than the total number of nonconvex colors. Second, we implement a branch and 
bound algorithm that terminates exploration early; this algorithm makes orders of 
magnitude improvement in run times for difficult problems. 




Figure 2. Example colored trees showing the (i) original and (ii) 
strong definition of convexity, assuming a and b don't appear else- 
where in the tree. In this figure, (i) and (ii) are convex according 
to the original definition, but only (ii) is convex according to the 
strong definition. 

Before proceeding on to outline how the algorithm works, note that the origi- 
nal definition of convexity is not the only way to formalize the agreement with a 
taxonomy at a given rank: for example, a stronger way of defining convexity is pos- 
sible (Figure [2|. In this stronger version, colors must sit in disjoint rooted subtrees 
rather than in disjoint induced subtrees. The algorithmic solution for this stronger 
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version will be a special case of the previous one as described below. It may be of 
more limited use for two reasons. First, it depends on the position of the root: a 
tree that is strongly convex with one rooting may not be so in another. Also, it 
is not uncommon for phylogenetic algorithms to return a tree like in Figure [2] (ii) 
although Figure [2] (i) may actually be correct tree; thus an algorithm that threw 
everything out except those that are convex in the strong sense might be overly 
strict. 




a,b,d 



Figure 3. A possible scenario encountered by the subcoloring 
recursion. The letters a, b, c and d represent the presence of leaves 
with those taxonomic labels; asssume these taxonomic labels do 
not occur elsewhere in the tree. The positions of colors b and c 
shows that this coloring is not convex, and a recursive subcoloring 
algorithm must decide at x in which subtrees to allow the b and c 
colors. 

The purpose of this first half of the paper is to derive efficient algorithms for the 
convex subcoloring problem in the parameter regime of interest: a limited amount 
of discord for large trees. The tree in Figure |3] serves to explain why the problem 
is combinatorially complex and motivates a recursive solution. The idea of this 
solution is to recursively descend through subtrees, starting at the root. Say this 
recursion has descended to an internal node x, and there are nodes of the color c 
somewhere above x. If the set of leaf colors in the subtree Ti is {b, c} and if the set 
of leaf colors in the subtree T2 is {a, b,d}, then some removal of colors is needed 
due to nonconvexity between the b and c colors. Assuming the leaf colors above x 
are fixed, the choices are to uncolor the c nodes in Ti or the b nodes in either Ti 
or T2. 

One can think of "allocating" these colors to the subtrees: the possible choices 
are to allocate c to Ti but choose one of Ti or T2 to have b, or do disallow c 
in Ti and allow b in both Ti and T2. Here and in general, the crux of devising 
an efficient recursion is to efficiently decide which colors get allocated to which 
subtrees. Convexity can be insured by explicitly choosing a color for each internal 
node, and making sure that the color allocations respect those choices in terms of 
convexity. 

In fact, selecting these color allocations is the only problem, as a complete set 
of color allocations is trivially equivalent to a choice of coloring. Indeed, given 



RECONCILING TAXONOMY AND PHYLOGENETIC INFERENCE 



5 



an optimal color allocation for each internal node, one can simply look at the 
allocations for the internal nodes just above leaves to decide whether those leaves 
get uncolored or not. Conversely, given a leaf coloring, one can simply look at the 
color set of the descendants of that internal node to get the set of colors allocated 
to the subtrees. 

In deciding the color allocations, we can restrict our serious attention to colors 
that are present on either side of an edge, such as b and c on either side of Ti 's root 
edge in Figure |3] We will say that these colors are cut by the edge. Colors that 
are not cut by an edge should not require any decision making when the recursive 
algorithm is visiting the node just above that edge. However, doing the accounting 
is not completely straightforward: of the cut colors, one might only allocate b to 
T2, but a and d can be used as well. Thus, allocations including some non cut colors 



must be considered, motivating the definition of the colors in play (Definition 10). 

Note that the ingredients of the decision being made in Figure [3] are the color 
specified by the coloring just above x (in this case fixed to be c), and the colors 
available in the subtrees below x. Given a set of colors to allocate to the leaves 
below X, the algorithm needs to decide how to allocate the possible colors to Ti and 
T2. One way of doing that is to test every possible allocation using the previous 
results of the recursion and score them in terms of the size of the corresponding 
subcoloring. Doing this with an awareness of the cut colors leads to an algorithm 
expressed in terms of the maximum number of colors cut by a given edge. 

However, building such a comprehensive optimality map is not necessary. By 
simply counting the number of leaves of each color below x, one can get upper 
bounds on the sizes of the corresponding subcolorings and only evaluate those that 
have the potential to be worth exploring. This observation is the basis of the branch 
and bound algorithm (Algorithm [T]) . 

2.2. Definitions and algorithms. A rooted subtree is a subtree that can be ob- 
tained from a rooted tree T by removing an edge of T and taking the component 
that does not contain the original root of T. The proximal direction in a rooted 
tree is towards the root, while the distal direction is away from the root. Given a 
tree T, let N{T), E{T), and L{T) denote the nodes, edges, and leaves of T. Given 
a set U, let 2^ denote the set of subsets of U. If the tree is not rooted, root it 
arbitrarily. 

Following the terminology of |10l lllj . a color set will be an arbitrary finite set. 
Definition 1. Let T be a rooted tree, and let F C L{T). A leaf coloring is a map 

A color c is cut by an edge e if there is at least one leaf of color c on either side 
of e. A multicoloring is defined to be a map from the edges of the tree to subsets 
of the colors. 

Definition 2. Given a coloring x on a rooted tree T , an induced multicoloring is 
the map x ■ E{T) — >■ 2'~^ such that x{e) is the (possibly empty) set of colors cut by 
that edge. 

Definition 3. Define the badness /3(x) of a coloring x to be maXeg_E(T) lx(6)|- We 
say that a coloring is convex if it has badness equal to zero or one. 

Definition 4. The total number of bad colors is t(x) = |Uee£^(^)| "where £ is 
the set of edges where |x(e)| > 2. 
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Definition 5. A subcoloring of a leaf coloring x ■ F ~> C is a coloring x' '■ G ^ C 
with G <Z F such that agrees with x on the domain of x' ■ 

These are partially ordered by inclusion of domains; the size of a subcoloring is 
defined to be the size of its domain. 

Problem 1. Given a leaf coloring x on a tree T , find a largest convex subcoloring. 

2.2.1. Previous work and motivation for present algorithm. The foundational work 
in this area was done by Moran and Snir |10L 111) . Their work is phrased in terms 
of "convex recoloring," i.e. finding the minimal number of changes in a coloring in 
order to obtain one that is convex. 

It suffices to consider subcolorings for the case of leaf colorings. Indeed, any 
recoloring can be turned into a subcoloring by removing the color of all of the 
leaves that get recolored. Conversely, any convex subcoloring can be turned into 
a convex recoloring in linear time |4]. For internal nodes, the color to be used for 
a given internal node is given by the definition of convex coloring. For leaf nodes, 
simply take the color of the closest colored node. In this equivalence, the number 
of leaves whose color is removed is equal to the number of leaves who get recolored; 
thus a minimal recoloring is equavalent to a maximal subcoloring. We only consider 
subcolorings in this paper. 

In pTj , Moran and Snir investigate both the general case of leaf colorings as well 
as the case of colorings including internal nodes. They also consider non- uniform 
recoloring cost functions, where a "cost" is associated with recoloring individual 
nodes and the goal is to find a convex recoloring minimizing total cost. In all 
settings, they demonstrate that the relevant recoloring problem is NP-hard. They 
also demonstrate fixed parameter tractablity (FPT) of the problems, as described 
in the next paragraph. In |10j they present, among other results, a 3-approximation 
for tree recoloring. 

The FPT bound for leaf coloring from jll], 0{n^ r Bell(r)), comes from an ele- 
gant argument using the Hungarian algorithm for maximum weight perfect match- 
ing on a bipartite graph. In fact, an inspection of their proof reveals that their 
algorithm is 0{nd^ t Bell(r)), where d is the maximum degree of the tree. Bell(/c) 
denotes the fcth Bell number, which is the number of unordered partitions of k 

items; these numbers are known to satisfy the bounds (jrili:) < Bell(/c) < (j^) 
[3] . Their recursion at a given internal node iterates over every unordered partition 
of the nonconvex colors, constructing a bipartite graph with edge weightings deter- 
mined from the sizes of subcolorings of subtrees using those color sets for the set 
of excluded colors. Applying the Hungarian algorithm to each such graph results 
in optimal solutions for every possible set of colors to exclude from the subtree at 
that internal node. Because every unordered partition of the non-convex colors is 
considered, the algorithm is exponential in r. For the case of general (i.e. not just 
leaf) colorings, Moran and Snir show that a dynamic program gives an 0(nT d'^^'^) 
algorithm. This of course also gives the same bound for leaf colorings. 

The work of Moran and Snir was followed up by many authors. For leaf-colored 
trees, Bachoore and Bodlaender [5| propose a collection of reductions to simplify 
the problem under investigation. These reductions encode some of the logic of the 
algorithm presented here, such as that trees that have disjoint color sets can be 
solved independently. They also use the fact that nonconvexity can be expressed 
in terms of the crossings of paths connecting leaves of the same color to show that 



RECONCILING TAXONOMY AND PHYLOGENETIC INFERENCE 



7 



the recoloring problem can be solved in 0(n4'-'^'^) time, where OPT is the optimal 
number of uncolored leaves. Note that this sort of bound is different than those 
described above, as OPT can get large even when the total number of bad colors is 
small. The work for the general case culminated in the work of Ponta, Hiiffner, and 
Niedermeier |13j , who use the childwise iterative approach to dynamic programing 
to construct an algorithm of complexity OinrS'^). 
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Figure 4. The relationship between f3, a local measure of non- 
convexity, and r, a global measure, for our example data set. Each 
point represents a single phylogenetic tree with taxonomic assign- 
ments at a given rank. 

For trees built from real data, taxonomic identifiers are not randomly spread 
across the tree in a uniform fashion. For example, species-level mislabeling will 
lead trees that are mostly convex with a couple of outliers, while a horizontal 
gene transfer will effectively "transplant" one clade into another. In both of these 
situations there is a non-uniform distribution of taxonomic identifiers across the 
tree, and nonconvexity in these cases may be local. Indeed, in Figure |4] we show 
the relationship between the badness /3 and the total number of bad colors t for our 
example trees, showing that the badness /? is significantly smaller than the total 
badness on a collection of phylogenetic trees for functional genes. This motivates 
the search for a fixed parameter tractable algorithm that is exponential in /? rather 
than T. 
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Furthermore, phylogenetics is typically concerned with a setting of trees with 
small degree. For example, many commonly used phylogenetic inference packages 
such as RAxML TF and Fast Tree [T3] only return bifurcating trees; these sorts of 
programs are the intended source of trees for our convexify algorithm. Even when 
multifurcations are allowed, the setting of interest is that of degree much smaller 
than /3 or r, which has ramifications for algorithm choice as described below. 

2.2.2. Algorithm. In this section we present our algorithm, which makes two im- 
provements over previous work for the parameter regime of interest. First, it only 
evaluates relevant colorings by restricting attention to cut colors, resulting in an 
algorithm that is exponential in /? rather than r. Still, such a complete recursion 
evaluates many sub-solutions that do not end up being used. Because the problem 
is NP-hard, we cannot avoid some such evaluation, but we might hope to do better 
than evaluating everything. 

This motivates the second aspect, a branch and bound strategy that can make 
orders of magnitude improvements in the run time of the algorithm. In order to 
make the branch and bound algorithm possible, the algorithm enumerates all legal 
color allocations first, and ranks them according to the upper bound function. By 
bounding the size of a solution for a given color allocation, we can avoid fully 
evaluating the sub-solution for that almost partition. A simple way of bounding 
the size of a solution for an color allocation is the max;imal size of the solution when 
convexity is ignored. 

Definition 6. Given a rooted subtree T' of T , define k{T') to be the colors of x 
cut by the root edge of T' as it sits inside T. 

Assume that T has been embedded in the plane, and that every internal node 
has been uniquely labeled. For every such label i let t{i) be the ordered tuple of 
labels below i in the tree, let T{i) be the subtree below i, and use K,{i) as shorthand 
for K,{T{i)). Vector subscript notation will be used to index both t{i) and the color 
set fc-tuples defined next; i.e. t{i)j is the jth entry of t{i). 

Definition 7. A color set fc-tuple is an ordered k-tuple of subsets of C . They will 
be denoted with n. 

These color set /c-tuples will represent the allocation of colors to subtrees. We 
will ensure convexity of these color allocations using the following two definitions. 

Definition 8. Given a color set k-tuple tt, 

i 

and 

Bin) - Uin^mr,) 

i<j 

Definition 9. An almost partition of Y d C is an ordered 2-tuple (6, vr) where 
b £ C and n is a color set k-tuple such that A^n) = Y and B{tt) C {b}. 

These will be the color allocations at a given internal node x with color 6; this 
definition guarantees convexity locally. As described in the Introduction, we would 
like to restrict our attention to cut colors, but this requires some attention as all of 
the colors used are not cut. This motivates the following definition, which describes 
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how the colors that are not exphcitly excluded as the complement of X are the ones 
available in the subtrees below i. 

Definition 10. Given i an internal node index and X C K{i) define 



G{i,X) will he called the colors in play for {i,X). 

Definition 11. Assume we are given an internal node i, X G and c ^ C . A 
legal color allocation for {i,c,X) is an almost partition (fe, tt) of G{i,X) such that 

(1) TTj C nitit),) 

(2) if ce X then b ^ c. 

Denote the set of such legal color allocations with A(z, c, X), and let A{i) — [J^ A(i, c, X). 

These color allocations are exactly the set of choices that are allowed when 
developing a subsolution for a cut set X such that the color c is just above X. 
The first condition ensures that the color choice sits inside the correct set of cut 
colors. The second condition says that an internal node must take on any color 
found above and below it. 

Definition 12. An implicit subcoloring for T' is a choice o/ 7r(i)) € A(i) for 
every i € N{T') satisfying the following compatibility property for every k € t{i): 



That is, the color allocation for every node descending from i is a legal color 
allocation given the choices of b{i) and n{i) made at i. 

As described in the Introduction, an implicit subcoloring defines an actual sub- 
coloring via the implicit subcoloring just proximal to leaf nodes. Indeed, say t(i)j 
is a leaf, and that (6(«),7r(«)) is the color allocation for internal node i. Then TT{i)j 
is empty or a single element by definition; the color for leaf t(i)j is used in the 
subcoloring if 7r(i)j 7^ 0. Furthermore, every convex subcoloring can be written in 
this form. 

Proposition 1. Implicit subcolorings are convex. 

Proof. Assume an implicit subcoloring 7r(j))}jg7v(T) ■ Let x be a coloring 

coming from an implicit subcoloring. If x is not convex, then there is an edge e 
such that |x(e)| > 2. Say {a,b} C x(e)- Without loss of generality, the colors 
will be positioned as in one of the two cases depicted in Figure [s] In case (i), 
\B{7T{i))\ > 2, contradicting the definition of an almost partition. In case (ii), 
is a by the definition of almost partition because a G B{TT{ii)). Then b{i) — a for 
every i between ii and «2, inclusive, by part 2 of Definition |11| and Definition . 
However, b G i?(7r(i2)), contradicting the definition of almost partition. □ 

With this in mind, we can now speak of the size of an implicit subcoloring as 
the size of its associated convex subcoloring. The goal, then, is to find the largest 
implicit subcoloring. 

Definition 13. Given internal node i and (6, tt) € A(i), a partial solution for 
(b, tt) is an implicit subcoloring for T{i) of maximal size such that the choice of 
almost partition for node i is {b, tt). 




ib{k),7T{k))eA{k,b{i),TT{i)k). 
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Figure 5. Two potential settings for nonconvexity along an edge 
e in the proof of Proposition [T] 

Theorem 1. There is an 0{nl3 2'^+^ {d-lYI^I'^) complexity algonth m to solve the 
subcoloring problem for leaf labeled trees. 

Proof. For every internal node i, define the question domain Q{i) to be C x 2'*^*^. 
An answer map at internal node i (resp. answer size map) is a map Y — ^ 
(resp. F ^ N) for some Y C Q{i). 

We will fill out an answer map ipi and an answer size map uji as needed at every 
internal node i as follows. For a given i, say we are given a question (c, X) G Q{i). 
If i is a leaf, then ipi{c,X) — X and ijJi{c,X) — \X\. Otherwise, say there are £ 
descendants of i. For each (6, tt) G A{i,c,X), find the answers (pt(^iy{b,7Tj), and 
their associated i^t{i)j by recursion. Let 

Wi(6,7r) = u}t(i)^ib,nj). 

Let uji{c,X) be the maximum value of tt) for (6, tt) € A(«,c, X), and let 
(pi{c,X) be the (fe, tt) obtaining this maximum. The result of this recursion af- 
ter starting at the root with every color for c will be a collection of answer maps 
for every i. 

These maps define an implicit subcoloring. This can be seen by descending 
through the tree recursively, using the ipi to get almost partitions from questions and 
passing the resulting questions onto subtrees. Specifically, for question (c, X) at in- 
ternal node z, let (6(i),7r(i)) := ipi{c,X) then recur by passing question {b{i)j,Tr{i)j) 
to ^t{i)j for every descendant j. Start at the root, with index p, pick the color Cp 
maximizing ujp{cp, 0), and begin the recursion with (cp, 0). 

This subcoloring is maximal. Assume any other convex subcoloring; this alter- 
nate subcoloring defines a question (c, X) for each nonroot internal node i, where c 
is the color of the edge above internal node i induced by the definition of convexity 
and X C K{i) is the set of cut colors that have leaves of that color below i. The 
collection of such questions also defines an answer (p^(c, X) for every such (c, X); by 
construction, the corresponding subsolution cannot be larger than that for (pi{c, X). 

The complexity estimate for a single internal node is as follows. The number of 
colors in play is bounded above by d/3/2, as each color in play must be cut in at 
least two edges. Thus, for a given question {c,X), choosing the allocation can take 
(d — l^^/^ steps for the colors other than c, while deciding where c is present can 
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take 2"^ trials. For a given internal node there are at most j3 2^ choices of question, 
giving the bound. □ 

An upper bound for oj can be used to construct a branch and bound recursion 
as follows. 

Algorithm 1 (Branch and bound recursion to find optimal implicit subcoloring). 

Assume a function Vi{b,'r:) > tt) for all (&, tt) S A(i). Proceed as in the proof 
of Theorem [7| with the following modification. For a given internal node i with 
c€ C and X C K{i), find ipi{c,X) as follows: 

(1) sort the elements (fo, tt) o/A(i,c, A") in decreasing size with respect to Ui. 

(2) proceed down this ordered list as follows, starting with q = 0: 

(a) compute uji{h, n); if q < U!i{b, tt) then set q — ujiif), vr) 

(b) call the next item in the ordered list (6',7r'). If q > 1/^(6', tt') then stop, 
otherwise recur to (a) 

(3) let ipi(c, X) be the (fe, tt) corresponding to q. 

A simple upper bound is the number of leaves that could be used given the 
restrictions in tt but ignoring convexity. That is, let iyi{X) be the number of leaves of 
T{i) with colors in X. Then define h'i{b, tt) = i^t{i)j (""j)- This upper bound gives 
significant improvement in time used over the algorithm in Theorem [I] (Figure [6]) . 

2.3. Computer implementation. The original algorithm described in Theorem[T] 
and the branch and bound algorithm in Algorithm [l] have been implemented in the 
rppr binary of the pplacer suite of programs http://matsen.fhcrc.org/pplacerj 
The code is in written in OCaml [1], an appropriate choice as it has O(logn) 
immutable set operations in the standard library. The input can either be a "refer- 
ence package" containing both taxonomic and phylogenetic information, or simply 
a phylogenetic tree along with a comma separated value file specifying the color 
assignments. Our implementation has been validated using an independent "brute- 
force" implementation in Python; the two codes return identical results on a testing 
corpus consisting of all colorings on all trees of three to eight leaves with up to six 
colors. The algorithm is called via a single command line call, which outputs a list 
of uncolored taxa for every nonconvex taxonomic rank as well as displaying them 
on a taxonomically labeled tree by highlighting them in red. 

One time saving difference between the implementation and the algorithm from 
the previous section is that the computer implementation has a notion of "no color." 
This is motivated by the fact that in the case when c ^ X and B{tt) is empty for 
an internal node i, there are a number of colorings of i that will provide a convex 
subtree. By collapsing all of the possible colors into a single "no color" in this case, 
we gain some savings in time and memory. 

The "no color" version of the algorithm can also be used to solve the case of strong 
convexity described in the Introduction. Specifically, restricting every internal node 
to have no color except for the internal nodes of subtrees that consist of entirely of 
one color leads to an algorithm for strong convexity. This strong convexity version 
is available via a command line flag. 

The data set used as a test set was a collection of 100 trees built from auto- 
matically recruiting sequences via a BLAST search via HMMs built from COG 
116' alignments. Taxonomic identifiers for the various ranks were found using the 



taxtastic software, available at http://github.com/fhcrc/taxtastic Each trial 
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Figure 6. Runtime comparison between the full recursion (The- 
orem [T]) to the branch and bound (Algorithm [T]). "DNF" means 
that the full recursion did not finish in the time and memory allot- 
ted. Symbols colored according to their badness {3, ranging from 4 
(blue) to 14 (red). 

was run three times and the results averaged; if any of the runs did not finish in 
8 hours, exceeded 16G RAM usage, or encountered a stack overflow, the trial was 
marked as "DNF." Every trial that completed according to these criteria using the 
full recursion also completed using the branch-and-bound. Colored trees with bad- 
ness strictly greater than 14 were excluded from Figure [6] as were trials that did 
not complete using either algorithm. The full recursion and the branch and bound 
implementations only differ by a switch that controls if the algorithm terminates 
early. Trials run on Intel Xeon (X5650) cluster nodes with 48G of RAM. This test 
data set is available upon request. 

3. Taxonomic rooting 

Researchers generally like to root phylogenetic trees in a way that the progression 
along edges from the root to the leaves is one of descent. There are a number of ways 
of achieving this, from using outgroups to using non-reversible models of mutation 
[17] . However, there has been surprisingly little work on one of the most commonly 
used informal means of rerooting, which is using taxonomic classifications. By 
this, we mean looking for a location in the tree such that the leaf sets of the 
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subtrees each have a single taxonomic classification at the highest taxonomic rank 
that contains uinltiple taxonomic identifiers. Here we formalize this process and 
describe algorithms for finding the taxonomic root or roots. 

Definition 14. A rank function for a set U is a map rk : 2^ — ^ N such that 

ik{A {JB)> max(rk(A), rk(S)) 

for all A and B in 2^ . 

It follows immediately that rk(y4.) C ik{B) when A C B G 2^ . By an abuse 
of notation, we also let rk(T) signify rk(L(T)) for a (sub)tree T with leaf set in 
the domain of the rank function. From a taxonomic perspective, rk(?7) will rep- 
resent the rank of the most specific taxonomic classification containing all of the 
taxonomic labels in U. For example, the rank of a set consisting of a genus-level 
taxonomic assignment and an order-level one is the rank of order. For this section, 
a taxonomically labeled phylogenetic tree is one for which we have a rank function 
on the leaves. 

Given x a node of T, let ^{x; T) represent the trees obtained by deleting x from 

T. 

Definition 15. Define the subrank subrk(a;; T) to be maxsg^^^..^) rk(S'), the max- 
imum rank of the subtrees of T when rooted at x. We will say x is a delicate 
taxonomic root of T if 

subrk(a;;T) = min subrk(2/;T). 
veN{T) 

This definition formalizes an intuitive definition of taxonomic root. For example, 
imagine that we have a tree with the three domains of cellular life in three distinct 
subtrees: Bacteria, Archaea, and Eucaryota; call the internal node that sits between 
these subtrees x. The subrank of x is domain. Any other internal node will contain 
some of each of the domains, and thus will have rank strictly higher than domain. 
In this case, x is the unique taxonomic root. 

However, if the tree is not convex at the subrank of the delicate taxonomic root 
then every internal node will be a delicate taxonomic root; thus the "delicate" 
terminology. Indeed, assume an internal node y and A,Bg ^'(y;T) such that 
ai,a2 C L{A), 61,62 C L{B), and subrk(ai,a2) = subrk(6i,62) = subrk(T). Then 
for any rooting, there must exist a subtree containing either {01,02} or {61,62}, 
and the subrank must be equal to that of T. 

We now develop a more robust definition of taxonomic root, which will require 
several definitions. Think of edges of the tree as being unordered pairs {x,y} of 
nodes. 

Definition 16. An arrow on an edge {x,y} is an ordered pair of nodes {x,y). The 
first node of the pair is called the origin of the arrow, and the second is called the 

direction. 

Definition 17. An arrow tree (T, A) is an ordered pair consisting of a tree T and 
a set of arrows A on the edges ofT. A complete arrow tree is an arrow tree such 
that for every node x of the tree there is some arrow in A with origin x. 

Note that {x,y) and {y,x) may both be part of an arrow set for a tree with an 
edge {x,y}. We will use "pointing towards" and "pointing away" in their usual 
senses as they relate to arrows in the real world. 
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Definition 18. The induced arrow tree (T, A) for a tree T and a rank function rk 
is a complete arrow tree defined as follows. For a given internal node x € N(T), 
say {Si, • • • , Sn} = \E'(a;; T) and let ri — rk(S'i) for I < i < n. Assume without loss 
of generality that ri < r2 < • • • < r„. There is some minimal 1 < j < n such that 
Tj = ■ ■ • = r„ . Let Ax be 

{{x, y)\y is the root of one of Sj, - ■ ■ , Sn}- 

Then A is the union of the Ax for all nodes x. 

Intuitively, induced arrows point towards potential taxonomic roots. 




Figure 7. Illustration of Lemma [T] 

Lemma 1. Say {T,A) is an induced arrow tree for a rank function rk, and that 
{x,y} and {y,z} are adjacent edges of T . If{y,z) g A then {x,y) € A. 

Proof. When a; is a leaf, (x, y) (li A is automatic, thus assume it is not. Using 
terminology from Figure [7| because {y,z) S A, 

rk(i?i U • • • U i?fc) <rk(C/). 

This implies that 

rk(i?,) < rk(C/) < rk(S'i U • • • U S'^ U [/) 

and thus that {x,y) A. □ 

Induction on the edges of a path shows the following: 

Corollary 1. Say {T,A) is an induced arrow tree, and that and {x,y} are 

edges of T such that the path from u to y contains both v and x. If {x, y) A then 

{u,v)eA. □ 

Informally, this corollary says that any time there is an arrow on edge 62 pointing 
away from edge ei, that there must be an arrow on ei pointing towards 62. 

Definition 19. A multi arrow node (MAN) for a taxonomically labeled tree is a 
node X such that there are two or more arrows in the induced arrow tree with x as 
an origin. 

Proposition 2. Say (T,A) is an induced arrow tree. If node x is a MAN then for 
any node y there must be an arrow in A with origin y pointing towards x. 

Proof. Since x is a MAN, there must be at least one arrow in A with origin x 
pointing away from y. This implies the proposition by Corollary [T] □ 
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Now imagine that x and z are two MANs, and y is on the path between x and 
z. By the above proposition, there must be arrows with origin y pointing towards 
both X and z, showing that y wih be a MAN. Thus: 

Proposition 3. MANs form a connected set. □ 

Definition 20. An edge {x, y} is a bi-arrow edge of an arrow set A if (x, y) and 

{y, x) are in A. 

Proposition 4. // the set of MANs is empty, then there is exactly one bi-arrow 
edge. 

Proof. First note that there cannot be two or more bi-arrow edges when the set 
of MANs is empty; in that case by CoroUary [T] there would have to be a MAN 
between them. Now assume there are no bi-arrow edges. Since the set of MANs is 
empty, then for every leaf of the tree the sequence of nodes determined by following 
arrows is well defined. Note that the arrow on every leaf is pointing into the 
interior of the tree, and thus the sequence of nodes starting from an arbitrary leaf 
cannot hit another leaf. Therefore the sequence of nodes must backtrack somewhere, 
contradicting that there are no bi-arrow edges. □ 

Definition 21. Assume a taxonomically labeled tree T. If there is at least one 
MAN then define the set of taxonomic roots to be the set of MANs. Otherwise 
define it to be the set of nodes of the bi-arrow edge. 

Let diam(T) be the node-diameter of T, i.e. the number of steps from edge to 
edge required to traverse the tree. Because every arrow with a non-root origin 
points in the direction of the taxonomic roots: 

Proposition 5. A taxonomic root for a tree T with n leaves can be found in at 
most diam(r) steps. □ 

3.1. Computer implementation. Taxonomic rerooting has been implemented 
in the rppr binary of the pplacer suite of programs [http : //matsen . f here . orgT] 
'pplaeer' However, rather than finding all possible taxonomic roots as described 
above, the program reports one of the roots after applying the maximal subcoloring 
algorithm as described in the previous section to the highest multiply occupied tax- 
onomic rank. Such a root is the closest approximation to the one "best" taxonomic 
root in the presence of nonconvexity. 

4. Conclusions and future work 

We have formalized the question of describing the discordance of a phylogenetic 
tree with its taxonomic classifications in terms of a convex subcoloring problem 
previously described in the literature. This coloring problem has some elegant so- 
lutions for the general case, but the parameter regime of interest here consists of 
trees of small degree and local nonconvexity. These considerations motivate a so- 
lution that solves a given recursion for as few "questions" as possible. The first 
component of this is to restrict attention to cut colors, resulting in a smaller base 
for the exponential complexity (Figure |4]) The second is a branch and bound algo- 
rithm that gives a significant improvement in runtime compared to the algorithm 
in Theorem [l] (Figure [6]) . To enable this the are only built up "upon demand," 
that is, when a given question is asked. The implementation described here is the 
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first of which wc arc aware, and certainly the first that conveniently integrates with 
taxonomic annotation. 

We also develop the first formalism for taxonomic rooting of phylogenetic trees, 
show that the original definition is useless in the presence of nonconvexity, and 
develop a more useful definition. This version can be found in time linear in the 
diameter in the tree. 

We are currently developing a computational pipeline to reclassify sequences in 
public databases based on these algorithms. 

We are also using these algorithms together to develop a collection of automati- 
cally curated "reference packages" that bring together taxonomic and phylogenetic 
for the purposes of environmental short read classification, visualization, and com- 
parison. 
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